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ABSTRACT 

A basic understanding of criterion-referenced test 
(CRT) use and develoi»nent is presented with definitions and 
characteristics of CRTs. The steps necessary to construct and 
validate a CRT, the appropriate use of CRTs, the historical 
development of CRTs, and terms used in conjunction with CRTs are 
discussed. The most common definition of a CRT is a sample of items 
yielding information directly interpretable with respect to a 
well-defined domain of tasks and specified performance standards. The 
major differences from norm-referenced tests are in development and 
interpretation. If a test has not been normed and the test items 
assess performance specified in objectives, it is a CRT. CRTs test 
only what has been taught and are curriculum aligned. CRTs are used 
to obtain examinee scores with some absolute meaning relative to a 
district curriculum. Validity, reliability, discrimination, and the 
degree wf difficulty of CRTs are discussed. CRTs must be :ield tested 
on repeated occasions if initial use produces undesirable statistical 
properties. The possibility of repeated testing should be built into 
any plan for CRT development. A 38-item reference list is included. 
(SLD) 
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Pbstr^act 

ft confusing array of terms used to describe 

Criterion-Referenced tests and their development has led to 
this attempt at reconciling meaning and interpret at loi^. This 
pmpBr describes the steps necessary to construct and 
validate a CRT, describes appropriate use of CRT's, reviews 
the historical development of CRT's and clarifies the terms 
used in conjunction with CRT's. A substantial bibliography 
citing both technical and popular sources is included. 
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EcJ oic^«^-fc o^^s F"±elcJ G :l cd e Hfc CRT 



f^lthough the term Criterion-Referenced Measure was coined by 
William Glaser in a oaper in 1962 < Glaser & Klaus, 1962) 
and substantial rhetoric has appeared in the literature 
since that date, no single source has attempted to bring 
together the practical applications and technical features 
of CRT's for examination by educational practioners. Lathrop 
(1966), for example, concluded that, "Plthough many 
technical articles have been written concerning the 
reliability of competency tests, few practioners appear to 
find such discussions helpful." p. 234. Discussions of 
validity and CRT item construction have been similarly 
unenl i ghtening. 

A variety of terms with essentially similar meanings have 
emerged in conjunction with Criterion-Referenced Testing. 
Programs with such varied titles as Competency Based, 
Learning for Mastery, Data Based Instruction, Mastery 
Learning, Group Based Mastery Learning and Outcome Based are 
all grounded in CRT technology. A lack of uniformity and 
array of terms with only minor variations in definition has 
created a somewhat confusing picture of CRT application. 
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This paper is an attempt to clarify CRT terminology as well 
as offering to pr act i oners a basic understanding of the 
current state of CRT use and development. It begins with 
basic definitions and characteristics of CRT's, describes 
development of CRT's, discusses the technical issues related 
to CRT's in laypersons' terms , summarizes successful 
applications of CRT use and offers a suggestion for solving 
one of the more controversial issues surrounding CRT use. P 
bibliography with both popular and technical sources is 
referenced. 

Basic D«f^initions 

The term, Criterion-Referenced Measurement was originally 
defined as a measure of student performance on a 
hierarchially arranged achievement continuum. This continuum 
was organized around psychological and developmental 
variables associated with each content area and maximized 
the probability of identifying a student's skill level 
within the pre-defined continuum. The student's test score 
was then interpret able as evidence of skill attainment 
within the hierarchy and provided direct evidence of skills 
mastered with implications for perscript ions for future 
development. This perspective on achievement testing was 
introduced in response to dissatisfaction with the almost 
exclusive reliance on 'Norm-Referenced' measures th^t 
identified a student's skill level in terms of 'Normal' 
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development of students with similar educational enperience. 
Norm-Referenced measures ignore the concept of an 
'Achievement Continuum* and focus rather on * Population' 
performance. The score obtained on a Norm-Referenced measure 
may be interpreted only as a comparison with -he performance 
of other, similar individuals, "»"his score has little 
personalized diagnostic or perscriptive value. 

Criterion-Referenced Tests have lost some of their original 
intent m current usage and rather than focusing on an 
'Achievement Continuum', tend to focus on specific programs, 
curricular sequences and > or objectives. This modified 
adaptation of CRT's has given rise to the terms Curriculum 
^•fmr-mncmd Test and/or Program Referenced Test . Technically 
speaking. Curriculum Referenced is a more approoriate 
designation for many of the current school testing 
applications. Because CRT's have been modified in current 
applications, they no longer contain the precise 
characteristics originally proposed by Glaser and although 
there remains considerable similarity between CRT's and 
Curriculum Referenced tests, if one wishes to be technically 
precise, one would distinguish between the two terms. The 
major difference between CRT's and Curriculum Referenced is 
the source of objectives, CRT's begin with a psychological 
and/or developmental continuum of skills while Curriculum 
Referenced Tests begin with specific courses of study. One 
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could Msily argue that courses of study are psychologically 
and/or developmental ly arranged hierarchies and that the 
distinction between CRT and Curriculum Referenced is 
relatively meaningless. This paper continues to use the term 
CRT while describing curriculum referenced applications. 

The most currently common definition of CRT is, "... a 
sample of items yeildmg information that is mterpretable 
directly with respect both to a w«ll defined domain of tasks 
and to specified performance standards." (Tirdall, et, al. 
1965, p. 203) The tasks for which performance stavidards are 
specified are derived from Competencies or Behavioral 
Object ives. These are the educational goals obtained from 
curriculum guides and/or written by district personnel. 
There is little, except semantic preference, to distinguish 
between competency and objective. Both define educational 
goals m terms of learner performance and both require 
elaf3oration for translation into CRT items. 

The tasks most frequently subjected to CRT scrutiny are 
called End Point:, Terminal or Outcome objectives. That is, 
the major outcomes m a o^Ac^o/^ubject are defined and 
converted into student performance statements. Again, there 
appears to be no distinguishing differences among the terms, 
'Endpomt', 'Terminal' and 'Outcome'. 
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M^gtgrv, Competence and Proficiency are words used 
interchangable to describe student attainment of behaviors 
implied in the objectives statement. Again, use of these 
terms is subject only to semantic preference. 

Endpomt or terminal objectives are often further defined by 
Enabling or Perequisite object ive«^ Enabling objectives are 
derived from a Task Qnalvsis of the terminal objective. That 
iSf each endpomt objective is broken down into its 
component behaviors. These enabling objectives are often 
coneidered the major focus of instruction with decisions on 
how best to present and attain these perequisite behaviorf9 
left up to teachers while endpoint objectives are the end of 
instruction standards set by a school district. 
Administrators advocating the adoption of CRT technology 
frequently state, 'We are not telling you how to teach, only 
what to teach. " or "We have specified the endpoint 
behaviors, its up to teachers to determine how best to get 
there. 

CRT Chanaic^vr ist iam 

It is impossible to aistmguish between ncrm-ref erenced and 
CRT items based merely on appearence. Exactly the same items 
could appear on either test. The major differences are in 
development and interpret at ion. If the test has been 
"Normed" then interpretation of scores is based on 
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comparisons with other individuals of similar age and 
•educational experience. If the test has not been normocJ f^ND 
the test items assess performace specified in objectives, 
then It is a CRT. 

CRT's reflect only the educational program adopted by the 
district. Norm-referenced tests, on the other hand, assess 
the average of all state curriculum guides and textbook 
publishers scope and sequence charts. McGraw-Hill (ORBIT, 
1984) has stated in the rationale for its support of CRT's 
that, • . a growing trend toward individualized, 
objectives-based instruction has uncovered a need for a 
WM«ure of student performance relative to curriculum that 
is more precise than that afforded by norm-referenced 
tests, "p. 1. 

CRT's posses Curricular Alignment . They test only what has 
been taught based on an assumpt ion that whatever gets 
measured, gets taugnt. This raises a question of the stated 
objectives becoming the minimum program and perhaps 
restricting the development of higher level and creative 
thought processes. This could, of course happen unless 
higher level and creative processes are specified as outcome 
behaviors. 

Because CRT's are aligned with the curriculum, they have the 
capacity to provide diagnostic information for teacher use 
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m instructional grouping decisions. One body of research 
has demonstrated that frequent Monitoring of students in 
hig^lly focused instructional settings produces sigr>ificant 
achievement gains. Although these findings have more 
recently been challenged there remains a considerable body 
of evidence suoportinp the use of frequent student 
monitoring. 

CRT Umm 

The question often arises, "Should I be using a CRT or a 
norm referenced test?". The answer is simply, "What purpose 
do you wish to serve?" If the purpose is to compare examinee 
performance with other, similar individuals; then norm 
r^9f9Y-Br\cmd. If the purpose, on the other hand, is to obtain 
an examinee score that has some absolute meaning relative to 
a district curriculum; then CRT. Districts may wish to use 
both a norm-referenced test to compare student performance 
with national norms and CRT's to identify specific student 
needs m terms of district curricula. If over testing 
becomes an issue, norm-referenced tests may be used with 
only a sample of the student population annually and still 
provide an accurate estimate of district achievement 
parameters. That is, alternate grades or classrooms may be 
tested with norm-referenced tests and still provide accurate 
district norms. 
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This does bring up the question of testing in general. 
Testing is carried out either to demonstrate account abl ity 
and/or to provide information for decision making at the 
individual classroom, ouilding an'j/or district level. 
Practically, testing is designed to produce information for 
decision making and asi long as test data is legimately being 
used to make decisions, then that testing is necessary and 
appropriate- School boards often insist on norm-referenced 
tests to demonstrate accountability- Seldom is this data 
used for decision rnaking and may, in fact, interfe\ e with 
more appropriate diagnostic data collection- 
CRT's may be used for diagnostic purposes during the course 
of daily instruction, they may be used for end of year 
placement/promot lor* decisions, as a graduation requirement 
and/or licensure decisions- Although much of what follows 
focuses more specifically on CRT use in diagnostic and 
placement decisions, generalizations are easily made to 
other uses. 

CRT's are extre-nely flexible to use. Items assessing one 
objective may be administered whenever the teacher desiras 
and /or several competencies ;nay be assessed in one setting. 
CAT data is useful both m Formative evaluation (information 
obtained on students for the purpose of making 
instructional, diagnostic decisions) and/or Sumniat ive 
evaluation (end of year data usee! in making placement 
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decisions). This flexibility gives CRT's two primary 
functions. First, short tests (4 to 5 items) may be used for- 
est imates of specific objective mastery that may then be 
used in classroom instructional management decisions and 
second, an end of year test based on a collection of items 
may be used for placement decisions. 

District developed CRTs may be printed on machine ^^corable 
pages amd scanned and analyzed either within the building on 
rt^icrocomputers or scanned and analyzed centrally. 
Technological dpvelopments in I nst r uct i ona 1 Wananenient 
Systems (Sheppard, 1986; Witthuhn, 1986)) has contributed 
significantly to the use of formal testing in instructional 
decision making by providing immediate turn around of test 
results. This immediate turn around feature used to be 
available only with teacher made tests that were time 
consuming to construct and correct and lacked both 
reliability and validity data. 

The relationship between test items appe-*-ine on formative 
measures and items appearing on sunmative measures 
introduces an unresolved dilemma. If teachers are directed 
to focus instruction on specific objectives for which 
formative measures have been developed and are being used to 
monitor instruction, then should those same items appear on 
the summative measure? Or, should the summative measure 
r^eoresent a more global assessment of instruction while 
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formative measures assess developmental and/or enabling 
objectives? If one could be sure that each of the parts 
(formative) added ud to the whole (summative) then it would 
be safe and reasonable to use formative measures for 
developmental purposes and summative measures for more 
global purposes- For exawple, -eading ski 11 development 
would become the focus of formative evaluation while a more 
wholistir measure of reading comprehension would set ve as 
the summative measure. This relationship ^till needs to be 
evaluated before definitive decisions can be reached and it 
is entirely possible that the final decision will vary 
across discipl ines ana levels. 

CRT's may also be used as pre-tests to provide a survey of 
student skills at the beginning of an instructional 
sequence. CIMS math, for example, makes extensive use of 
survey tests (CIMS, 1986) at the beginning of the year to 
assist in instructional planning. 

Dmy/m I opin»nt 

The development of CRT's begins with objectives that have 
been written or selected by district curriculum specialists- 
Objectives are more easily communicated if they are simple 
statements of student outcomes. For examplcr; "Students will 
convert word problems into number sentences. " or *'Stude».ts 
will draw inferences from grade level reading selections." 



CRT 



Page lid 




J2 



This approach is preferred to that of parlier writers (eg. 
Mager, Popharn, etc.) who recommended that each objective 
contain not only the behavior to be achieved but also the 
cond- • - r> and criterion for demonstrating attainment of that 
behavior. This led not only to extremely awkward and lengthy 
statements but required several different skills that are 
more appropriately apportioned to different persons. 

Experience has demonstrated that something in the 
neighborhood of £0 well defini?d objectives at a relatively 
high level of generality 7 s a reasonable number per subject 
per grade level. This translates into 80 test iteus on a 
summative test if or.e uses the criteria of 4 test items per 
objective. Each of these higher level 'terminal' objectives 
may be broken dow-« into perequisite objectives for 
instructional and formative evaluation purposes. 

Once agreement on objectives has been reached, overtly 
observable student responses must be defined for each 
object ive. This observable student response is labeled 
Condit ion and identifies the parameters of a knowledge or 
content Domain from which test items may be drawn to assess 
mastery status of each objective. There are two approachs to 
defining conditions; Deduct ive and Induct i ve . 

1 • ft Deduct ive approacn beg ins with a narrrat i ve 
statement that clearly defines the parameters of a 
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domain. The statement must make it possible to 
discriminate between appropriate and inappropriate test 
items. 



Example: 

Objective - (Third grade arithmetic problem solving): 
Students will convert word problems into number 
sent ences. 

Condition: Given a three sentence, one step word 
problem requiring either addition or subtraction with 
rio extraneous information and single digit values, 
students will select a number sentence representing the 
information presented in the problem. (Note - the 
criteria for determining mastery need not be stated in 
the cond i 1 1 on. That is, t he number of correct responses 
required for demonstrating mastery neec ^ot be included 
at this point. ) 
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2. The Induct i ve approach begins with eHamining a variety 
of test items. These items are divided into two groups; 
those that reflect the intent of the objective and those 
that do not reflect the intent. From these two groups, it 
IS thfin possible to write a condition statement. 

Express Intent Do Not Express Intent 

£3 9 6 £3x^6= 96x5 = 

X 4 6 X _5 

Any item with multiple 
choice answers. 

8 6X105 = 

Condition: Given multiplication examples in a vertical 
format with two digit multiplicands and one or two digit 
multipliers, students will compute and write the product- 

alternative condition statement at a more general level 

isi 

Given two digit multiplication examples in either a 
vertical or horizontal format, students will compute and 
either write or select the correct product. 

The level of specificity at which a condition \s written may 
be a function of whether the test items are to be used for 
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formative or summative purposes. Items at tht formative 
lev«l might well be more specific. 

Although the step of condition writing must be directed by a 
measurement ^^pecialist, it is desirable to involve classroom 
teachers m the process. The range of expectations and 
perspectives offsred by classroom teachers with respect to 
the specific population they are dealing with adds 
substantially to final acceptance of the CRT approach. It is 
also desirable to stress that object ives have not been 
written in stone and that in order to develop conditions for 
overtly observable responses, at may be necessary bo modify 
some objectives. 

Measurement experts agree that specification of the don.ain 
from which test items may legitimately be drawn for 
evaluation of objectives is considered the single most 
important step in the construction of CRT items. Hambleton 8t 
Novick <1973> for example, state, "If the proper domain of 
test items , measuring an objective is not clear, it is 
impossible vo select a representative sample of test items 
from that domain." p. 32 

The importance of this step is tllustrated in the following 
examples 

Object i vei "SturJents will be able to describe action 
portrayed in a oicture. " 
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Focus of instruction by: 

Teacher #1: Teaches students to isolate details in the 
picture and to relate each of these details in a 
sentence. 

Teacher #2: Teaches students to infer what happened 
immediatly prior to and after the picture was taken and 
to describe the hypothetical sequence of events in a 
three sentence paragraph. 

Teacher #3: Teaches students to relate the picture to 
personal experience and to desribe how people in the 
picture might feel about whot is happening. 

Condit ion: Given a picture portraying action, students 
will be able to write a three sentence paragraph 
accurately describing the action. 

Although each int repretat ion of the objective as given is 
accurate, each is different and the "test item'* fails to 
assess the focus of instruction by any of the three 
teachers. Because of these very typical differences in 
interpretation of an objective, it is imperative that a 
Knowledge domain for each object ive be clearly defined. 

Although a substantial proportion of conditions will specify 
traditional paper and pencil type responses, some objectives 
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may more appropriately call for student generated oral 
and/or written responses. For eMamples 

Read fluently a 150 word passage in £ minutes- 
Pronounce consonant sounds. 
Pnswer a question in a complete sentence. 

These are called Teacher Certified cQwoetencies and each 
requires clearly defined criteria to discriminate between 
correct & ,d incorrent responses. Experience has demonstrated 
that there will be considerable teacher variability in 
rating student generated responses and teacher certified 
competencies have a potential for extremely low reliability. 
Experience nas also shown that these items present excellent 
opportunities for in-service training in which student 
performance standards in response to these open ended items 
are the foci of discussion. 

Once conditions defining the parameters of knowledge/content 
domains for each objective are specified, test items from 
each domain are generated and a random set of these items 
selected for use in assessing student mastery. Test item 
construction is a technical task requiring the assistance of 
a measurement specialist. Both textbook unit tests and 
standardized tests may be used as models for item 
construct ion. 
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Although the number of iternB needed for an accurate estimate 
of total domain oerforrnance can be determined statistically, 
these techniques are beyona the scope of this paper and 
readers are refered to Haladyna & Roid (1983) and Harnbleton, 
Craig & Simon (1963) for detailea steps. In actual practice, 
it is common to find 4 to 5 items for each objective. This 
number will vary depending on how broadly or na;-"rowly the 
objective is written. Generally, it is desirable to focus 
each objective cn such a narrow range of behaviors that 
variabilty within the knowledge domain of a given objective 
is minimal. This does, however, raise another dil<o •na. Too 
narrowly focused objectives raise the issue of Specific 
Steri 1 itv and may limit transfer while too broadly defined 
objectives lt=*ad to arnbiquity in tne instructional focus. 
Developers are warned of this problem and encouraged to work 
t OMard compr om i ses. 

If CRT's are to be used both format ively and summatively 
and /or alternate forms of tests are to be used, it will be 
necessary to develop a pool of several test items for each 
objective. The relationship of formative to summative 
measures as indicated earlier, will also influence the task 
of test item construction. 

Several test publishers and educational organizations have 
created banks of CRT items that may be purchased once domain 
specif icat ions have been developed. Publ ishers are often 
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anxious to identify aopropnate teat items for districts 
that hav» specified their student response conditions. 

Once test items from each domain have been selected or 
created, these test icems must be fieJd cested. This leads 
to several technical considerations discussed in the next 
s»ct ion. 
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Validity, Reliability, Discr^iminat ion and Degr*ee 

of Difficulty 

Before discussinq in detail each of these attributes of CRT 
item construction, it should be understood that standard 
techniques used in norm referenced test construction must be 
modified for CRT use. The primary reason for this is the 
lack of variability in student responses that occurs in 
CRT's- That is, the basic statistical procedures used in 
norm referenced test construct ioon are dependent upon 
substantial variability in student responses to test items. 
In fact, norm referenced test items arr^ constructed so as to 
maKimize this variability. Popham & Husek <1969) state that, 
'•With criterion-referenced tests, variability is 
irrelevant." o. 3 and Hambleton et. al. (1978) concluded 
that, "... classical approaches to reliability and validity 
estimation will need to be interpreted more cautiously (or 
discarded) in the analysis of criterion-referenced tests. " 
p. 15 

second factor impacting on statistical data supporting a 
CRT is the seriousness of decisions reached as a result of 
interpreting scores. The statistical support for a CRT used 
as a high school graduation requirement needs to be far more 
rigorous than for a CRT used during the course of 
instruction for grouping decisions. In the later case^ 
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additional information is generally available to the teacher 
and there is ample opportunity to review each decision. 

V^ltdttv Validity is technically not a characteristic of a 
test but rather a consideration related to the inferences 
drawn from an examinee's test score. That is, a test of two 
digit multiplication is valid for inferences relative to two 
digit multiplication. It is not valid for inferences related 
to either one digit or three digit multiplication. 

Validity has often been described as though there were 
several different forms of validity, eg. content, construct, 
concurrent, face, logical or predictive. In reality, there 
is only one form of validity and that involves the 
relationship of a decision based on test scores to a true 
state of being. That is, an inference based on test 
performance is being made about the true or real nature of 
an examinee. Pt best, test performance is only an estimate 
of an individuals capability and validity is an attempt to 
quantify the accuracy of this estimate. There are multiple 
approaches to this quantification process; i.e. content, 
const r uct , et c . 

CRT scores are generally used for two different kinds of 
inferences (derisions) and therefore require two different 
types of validity. 
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The first valiaity issue iss ^ Does the score obtained by a 
student on a sample of test itorns accuri-irely reflect total 
oomain perf ormance'?' Stated dif ferent ly, ' Does student 
performance on the sample of test items provide an accurate 
estimate of the total domain score?' This is called Domain 
Score Val idit V . The domain of interest may be a single 
objective as illustrated in an earlier example or it rnay be 
somet h 1 ng as en t ens i ve as read i ng comprehens i on. Doma i n 
score validity is more of a concern when examining test 
items related to a spec?.fic objective, that is, for^mative 
eval uat ion. 

The second validity issue is: 'Has the student mastered the 
domain (s) from which test items have been drawn?' Pgain, 
stated differently, 'Does the test accurately discriminate 
between students who have mastered the object ive(s) and 
those who have not mastered the object ive <s) '^' This is 
called. Mastery Status Validity , Mastery status validity is 
more of a concern with summative tests that contain a range 
of domains covered ouring the year. 

Domain Score Validity may be established by comparing 
student performance on a sample of test items drawn from a 
domain with performance on all test items included within 
that domain. For example, in the objective: "Name the 
numerals 0 - 9", a sample from the domain might include only 
four test items whi le the ent ire domain contains ten 
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possible responses, statistical comparison (correlation) 
of student performance on the sample of four items with 
performance on the total 'jornain of ten items would provide a 
statistical value for this domain score validity. This is an 
easy task when the domain is well defined, discrete and 
relatively smal 1. Reasonable 1 imitat ions are quickly 
enceeded when one considers that the domain for pairs of two 
digit multiplication enarnples in a vertical format contains 
8,100 possible different responses and that the domain for 
the objective; 'Drawing inferences from a reading 
selection.' is indeterminate. It would be impossible to 
assess student performance on the total domain in either of 
these latter examples. 

In actual practice, domain score validity is most often 
established by a process involving the use of "Judges" who 
•Mamine test items to determine if they accurately reflect 
the domain. Pgreement across several judges is used to 
establish 'Face' or 'logical' validity of the domain 
inference. It is assumed that if several judges agree that a 
sample of test items accurately represents the domain, then 
performance on the sample is an accurate estimate of the 
dorna i n score . 

Mastery Status Validity is a far more difficult issue and by 
far the most controversial adaptation of CRT's. Technically, 
mastery of a domain implies that a student will be able to 
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respond correctly to every test item from that domain. 
Realistically, because of guessing, lapses in attention, 
carelessness and measurement error, standards for 
determining domain mast€»ry art. generally set at something 
less than 108%. It is common to find cut-off scores in the 
70 to BW range. Several different approaches have been used 
to establish a 'standard' (cut-off scor^> that accurately 
discriminates between masters and non-masters. Examples 
includes 

1. Use of Judgea - Expert judges examine the 
competencies and test items and predict the 
performance of a borderline student on each test item. 
The probabilities of borderline student responses on 
individual items are summed to determine a score for 
the boundary between Mastery and Non-Mastery. Secolsky 
(1987) has, however, demonstrated that there is 
considerable variability in expert judgment. 

The Cut -Off score (standard) arrived at in this manner 
means that the student has mastered what a group of 
judges believes to be the performance of a minimally 
qualified student. The validity statistic is bhe 
correlat ion across al I judges. 

2. Teachsr Prediction of Mastery - Teachers familiar 
with each student predict who the Masters are and 
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these Dredictions are compared w test performance. 
P Cut-Off score then G?stablished tc include as 

large a proportion of teacher predicted niasters above 
the cut-off as possible and as large a proportion of 
teacher predicted non-masters Lelow the cut-off as 
possible. 

Masters determined by this approach are students whose 
minimum score is like that of successful students in 
past years based solely on teacher judgment. The 
validity statistic is the correlation between teacher 
judgment and st udent test performance. 

3. Predictive Capacity - CRT performance is correlated 
with some future measure of performance and bhis 
correlation is used to determine the CRT score 
necessary to insure future success. 

Masters determined by this approach are students who 
are similar to those who have experienced continued 
success in the past. The validity statistic is the 
correlation between current and future performance. 

Each of these standard setting procedures will produce a 
different cut-off score and there is no substantive defense 
for use of any of these procedures. The use of an absolute 
cut-off score, that is, a pre-deterrnined score for use in 
determining mastery status, can only create controversy. The 
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development and use of a relative cut-off score will be 
discussed later. 

Regardless of where a standard is set, there will be errors 
created because test i^ems do not adequately reflect the 
domains assesses /alidity) and because student performance 
on the test varies from one occasion to the next 
(relicOility). The greatest number of errors will occur 
closest to the cut-off score regardless of which standard 
sett ing procedure is used- That is, students scoring just 
below or just above the cut-off score are most likely to be 
rnissclassif ea. These are true masters who score below the 
cut-off and true non-masters who score above the cut-off. 
(Fig. 1) 

Fig 1. 
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Although a .lumber of rather complex statistical procedures 
have been put forth in an attempt to reduce these errors, 
<Hambleton, 1978; Hambleton & De Gruijter, 1983; Haladyna & 
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Roid, Lxnn, 1978 & Shepard, 1980?) they are generally 

overly cumbersome for the results obtained and each can do 
litt '3 more than reduce the number of errors. The error«> 
cannot be eliminated' 

The standard setting limitation of CRT use has received 
substantial criticism and has led at least one leading 
measurement expert (Glass, 1978) to conclude that CRT' s 
should not be used for mastery/non-mastery classifications 
but instead, should be used only to determine if the rate of 
learning goes ud or down. Educat lonal placement decisions 
are then attached to a rate of learning interpretation. ft 
host of other experts, on the other hand, (Berk, 1980; 
Hambleton, 1978; Popham & Husek, 1969 and Shepard, 1980) 
offer that the arbitrary standard imposed by CRT's is better 
than no standard and certainly the achievement gams 
attributed to CRT use (Abrams, 1985; Guskey, et. al. 1986 
and Fuchs, et. al. 1986) suggest that there is substantial 
value in the use of CRT driven instruction. 

An alternative to use of an absolute cut-off score for 
determining mastery status that has significant potential 
for reducing missclassif ;at ion errors has been introduced 
by Lathrop (1986). His approach calls for two cut-off scores 
with an 'uncertain' area between these two scores. (Fig. S) 
Students above the upper cut-off score are clearly masters 
while students below the lower score and clearly 
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non-masters. DLr^isions on students in the 'uncertain' area 
can be based on a variety of data. Lathrop recommends 
additional testing. If, however, formative evaluation during 
the year has been recorded, this would appear to have great 
value in determing mast ery /non-mast ery status of students in 
the ' uncertain' area. 

Fig. a 
Cut "-off Score Range 
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This approach has significant potential for use in 
situations where CRT's are being used for grade level 
promotion. Particularly in view of recent evidence that 
straight ncn-oromot ion appeav s to be counterproductive. 
(Peterson, et.al. 1987? Holmes & Matthews, 1984) That is, 
students f ani ling to meet minimum competency requirements 
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seldsrn benefit from repeating an already proven failure 
enperience. Although there are students in need of 
remediat ion, that remediat ion is more profitably presented 
in the form of an alternative and the specific alternative 
required could more easily be determined given CRT 
performance > data. 

Relat^lve Cut-Qf£ Seor^= If one accepts the assumption that 
educational outcomes are at least partially a function of 
available resources and that educational resource 
allocations are not made based on educational need but 
rather on political and economic consideration, then it 
follows that resource allocation should serve as the basis 
for mastery/non-mastery decisions. The effect ivness of any 
schooling organization to produce specific outcomes is 
partially related to financial resources allocated by a 
society that is more concerned with lower tax rates than 
with mastery. Non-masters, i.e. students in need of 
additional assistance are determined therefore, not on the 
basiF of any absolute standard but rather on the basis of 
resource availability. This leads to setting standards that 
permit a known percentage students to receive special 

assistance. 

This 'Relative Standard' approach is used by the State 
Educate tn Department of N»3W York in setting boundaries for 
remedial emphasis on PEP tests as wt.l as its use of 
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r»sourc«s based on MR reports. The lowest portion of a 
population is served based on the assumption that the entire 
system will best be served by focusing its use of limited 
resources on those in greatest need. There has been no 
attempt to set an ^absolute value* as the target for either 
individual students or school systems on New York St3te 
mandated tests. . Such a target could only be arbitary and 
would be subject to continual cont roversey. 

Given this 'relative cut-off score' argument, it becomes 
obvious that CRT application in grade level promotional 
decisions is currently the only truely defensible use of 
CRT's. Given the limitations of arbitrarily established 
absolute cut-offs, CRT's have limited value in setting 
graduat ion and 1 icensure decisions. 

Reliability: Unlike validity, reliability is a test 
characteristic. Reliability is an index of the instruments 
(test) ability to repeatedly produce the same result. 
Reliability for norm-referenced tests is based on how 
closely the same score for each individual can be 
replicated. The closer a test comes to repeating the same 
score for each examinee on two different administrations, 
the higher the reliability. In CRT's however, one is 
concerned only with the test's ability to replicate the 
master /non-mastery distinction. This is a qualitative 
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decision compared with the quantitative Decision required in 
norm-referenced t est s. 

Reliability is calculated for norm-referenced tests by 
administering alternate forms of the test to the same 
population and then correlating scores of individuals on the 
two administrations. A second approach compares the score on 
one-half of the test with a score on the other half 
(split-half or internal consistency rel iabi 1 ity > • Because 
CRT reliability is concerned only with replicating the 
mastery /non-mast ery distinction, a slightly different 
computational procedure is applied with either the alternate 
forms or split hal^ aoproach. Although it is generally 
recognized that limited student response variability on 
CRT's will result in lower reliability estimates. Kane 
<198&) has demonstrated that reliabilities below .50 must be 
viewed with caution. 

Eriacrimination iv> an item characteristic rather than a test 
characteristic. P discrimination value describes the 
frequency that masters respond correctly to a single test 
Item and that non-masters respond incorrectly. Items that 
non— masters get correct as often as mastern do fail to 
discriminate between the two populalt ions. The 
discrimination index is a classicoil test item characteristic 
that is applicable t z» CRT i-^,ems. Modification in 
interpretation is, however required. Normally, in 
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norrn-referenced testing, one looks for item discrimination 
indices of .30 and above. That is, the ratio of high scoring 
students pr^-^ducing a correct response to the ratio of low 
scoring students getting that item correct is .30 or higher. 
Although still a good general rule of thumb to follow in CRT 
deve 1 opriient , it i s a 1 so i mpor t ant t o e Kam i ne each t est 1 1 em 
carefully to determine how the item relates* to the nr-astery 
distinction. Given the criteria of item value, 
discrimination indices with lower values may be retained in 
CRT applications. 

Degree of Difficulty is another test item characteristic. 
"Ty-ie degree of difficulty describes how often examinees are 
likely to respond correctly to the item. This is anothc"^ of 
the classical test item characteristics that is applicable 
to CRT items. A degree of difficulty may be computed for 
each test item. Although traditionally a 50^ error rate is 
cons i dered d es i r ab 1 e i n norm-referenced t es t const ruct i on, 
CRT's tend to focus on a degree of difficulty that is more 
sensitive near the cut-off score. That is, individual test 
item difficulty snould be determined in conjunction with the 
cut-orf score. the final cut-off range is in the 70% to 
80% area, then item difficulty should be set near these 
val ues. 

Field Testing 
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Each of the CRT statistics described above must be 
established with a sarnole of the oooulation for whom the 
test IS iriiDended. The determination of a suffucient sample 
is often a dilemma. The more closely a samplp approximates 
that of the total population, the more accurate are the 
estimates of population characteristics. There is, however a 
point of diminishing returns. There is a point at which 
increasing sample size contributes to increased accuracy of 
the statistical estimates only minimally. My own rule of 
thumb is the larger of 20^ or lO© examinees with a 
representative corss section of the total population. 

Individual test times are examined for both degree of 
difficulty and discrimination characteristics. Groups of 
items to be administered as a 'Test' are examined for 
reliability. Because reliability is a function of test 
length, short tests to be used with only one or two 
objectives are not subjected to reliability analyses. 

It may be necessary to field test CRT's on repeated 
occasions if initial use produces undesirable statistical 
properties. The possibility of repeated testing should be 
built into any plan for development of a CRT. It is also 
desirable to use up to two or three times as many items as 
needed i*^ field testing in ^recognition of the fact that some 
items will be discarded because of their statistical 
pro pert ies. 
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Adm i n i St rat i on 

Test items will be packaged for administration mucn like 
existing standardized tests. In fact, existing standardized 
test forms may be used as models for the development of 
CRT' s. 

Summative CRT's are administered at the end of the 
instructional orocess in a secure testing situation similar 
to standardized test administrations. Frequency of summative 
testing for critical decision making, i.e. placement of 
individual stuaents 13 an issue to be considered. Will major 
placement decisions occur at every grade level or only at 
specific grade levels where remedial resources are 
concentrated for more effective utilization? The frequency 
of summative testing and critical decision points will 
depend on the availability of remedial resources. Districts 
have used a variety of approaches to establishing Gates for 
uninterupted promotion, e.g. gates at grades K, £, 5 and 7| 
every ^.-ade level, grade four only, etc.. 

When to begin summative testing is yet another issue. 
Developmental 1st s and proponents of Whole UanQuaoe will 
argue that formal testing should not begin until grades 3 or 
A. Measurement specialists and proponents of accountability 
will argue that formal testing should begin m Kindergarden. 
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This decision will deoend on value judgments within the 
district and will likely vary from district to district. 

fldminiotration of formative CRT's will depend on several 
factors. If curricular objectives are closely aligned with a 
specific textbook or program sequence, then CRT's may be 
administered as unit tests. If curricular objectives are 
independent of program, i.e. teachers have the freedom to 
use any program they wish to achieve course objectives, then 
formative CRT's may be administered either as each objective 
13 taught or at given intervals, eg. five weeks, quarterly, 
etc. 

ManagmiMint 

Although formative CRT's are designed primarily for teacher 
use in instructional decision making within individual 
classrooms, some form of district-wide management system can 
increase the effectiveness of formative CRT's. Distribution, 
<5Coring and record keeping of formative CRT's can become a 
logistical nightmare unless a management system is created. 
O management system in\olves grouping of test items into 
some type of package so that teachers are not pulling out 
one set of 4 test items each time an objective is completed. 
Packaging test items implies some form of either Pacing (the 
rate at which instruction is to occur) or Seguencino (the 
order in which objectives are to be oresented). Although 
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packaging test items and setting a testing schedule reduces 
beacher flexibility, adrninist srators will be wise to diszruss 
the trade-offs with te^^chers. 

The administration of formative CRT's will also depend on 
the nature of course objectives; that is, are objectives 
developmental or at higher, more general izable levels. 
Developmental ly stated objectives will require more testing 
•♦:han a few general level objectives- 
CRT' s at the orimary level present a problem in that 
^students at this level cannot be expected to transfer 
answers to machine scorable response sheets. Either hand 
scoring of tests must be done or special forms must be 
printed to accommodate student responses on the test pages. 
The number of test items that can De accommodated on each 
page for testing at the primary level is limited thereby 
creating a logistical problem for storage and distribution 
of test forms. NCS (Sheppard, ISBS) has developed a generic 
machine scorable response sheet that facilitates immediate 
turn around for test results within the building through use 
•^^ scanners, microcomputers and printers at bhe building 
level. These generic response sheets can be used at any 
level. 

R@sulibs .r.f=^ CRT* pi !=> 1 ±c=^a^ i otm 
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In 1978, Harnbleton (1978) stated that in theory, "... 
objectives based programs were designed to: 

1- define curricula in terms of objectives. 

S. use these objectives to drive instruction. 

3. provide on-going evaluation data for instructional 

decisions. " p. 280 
He also conduced that hard evidence in support of this 
theory wao in short supply. 

Today, there appears to be no shortage of evidence to 
support the use of CRT technology in conjunction with 
objectives based programs. Reseach findings <Pbrams, 1985; 
Barber, 1979? Conner, et. al. 1985; Conyers, et. al. , 1985; 
Fuchs, et. al., 1986; Guskey and Gates, 1986; Hyman, 1979 
and Mevarech, 1985) report that: 

1. instruction directed at specifically defined behaviors 
(objectives/competencies) is far more effective than 
n 1 oba 1 i nst r i ict i on. 

2. frequent, formal monitoring of students with CRT's 
aligned with curricular objectives/competencies is 
♦superior to teacher judgment of student progress. 

3. student progr(?s3 tied directly to specific objectives 
encourages more effective utilization of instructional 
resources. 

A. parent involvement increases with the Drecise 
information made available in objectives based programs 
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accompanied by frequent CRT assessment and reporting of 
progress. 

5. student knowledge of progress, accomplishments anJ 
expectations is increased with objectives based programs 
and CRT assessement. 

These conclusions are derived from research summariesf 
meta-analysis of research reoorts and individual district 
^summaries of student gains. There appears to be little doubt 
that the formative use of CRT's in well defined objectives 
based programs confcriDutes to student achievement gains as 
measured by a variety of indices including the traditional 
' Norm^Referenced' standardized test. 

Not all reviews are so overwhelmingly positive. A dissenting 
>-eview by Slavin, (1987) concludes that the claims of 
Mastery Learning proponents are highly exaggerated. His 
analysis of highly selective studies reveals at best very 
minor gains and he questions weather these gains are the 
result of more time on task for some students or the 
monitoring function. Obviously, frequent monitoring has led 
to more efficient remediation and the recent development of 
more alternatives in remedial efforts has increased their 
effectiveness. Weather increased student gains come from 
more time on task or frequent monitoring seems irrelevant. 
The fact is that fy-equent monitoring with formal tests has 
led to increased achievment of students. 
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